Skip to content

Implement dynamic gang header sizes #17004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

pcd1193182
Copy link
Contributor

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Motivation and Context

ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with.

Description

The ZDB changes are first; basically, there are just some small tweaks to make it easier to work with gang blocks. First, the compact blkptr printer now notes which DVAs have the gang bit set. There is also a fix of a bug that's been around since 2009; ZDB gang block header printing has been broken since then. The problem is that if you do a zio_read of a BP with the gang bit set, you don't get back the header, you get back the underlying data. The fix is to just not set the gang bit.

The way dynamic sized gang headers work is that the amount of space we allow ourselves to use for the gang header is equal to the size of the smallest possible allocation on the vdevs we got back when we allocated it. This is necessary to work around the fact that the ASIZE for a gang BP isn't the allocated size of the gang header, but of the entire tree under it. Because of that fact, when reading, claiming, or freeing, the allocated size of a gang header must be able to be determined from no other information than the vdev it was allocated on. We reuse the existing vdev_gang_header_asize for this. We take this minimum space, and pack the block full of blkptrs, up to a tail with an embedded checksum. This allows us to store many more gang children per header, leading to much shallower gang trees if we're forced into intense ganging.

One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in zio_checksum_error, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.

Much of the remaining changes are just tweaking the gang tree orchestration logic to work with a variable number of gang block pointers. The final interesting change is to the logic that determines the size of the gang children. When only 3 gang children were possible, it didn't really matter what the sector size was; you were going to be allocating at most 3 of them, so the space waste was limited. With large sector sizes, however, it could get expensive: A 128k allocation that gangs on a system with 64k sectors will, without changes, consume 256 64k sectors, with 512 bytes of data each, wasting 128x as much space as the original allocation. The fix is to use the spa_min_ashift rather than the MINBLOCKSIZE when calculating the lsize.

How Has This Been Tested?

In addition to the ZTS and zloop, extensive manual tests were performed, verifying that 1) the new ZDB functionality works, and 2) that larger gang headers are correctly used and allocated when applicable. Mixed ashift testing occured, as well as the full compatibility matrix of old pools, new pools without the feature enabled running against old and new code.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

zfeature_register(SPA_FEATURE_DYNAMIC_GANG_HEADER,
"com.klarasystems:dynamic_gang_header", "dynamic_gang_header",
"Support for dynamically sized gang headers",
ZFEATURE_FLAG_ACTIVATE_ON_ENABLE, ZFEATURE_TYPE_BOOLEAN, NULL,
Copy link
Member

@amotin amotin Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I also thought about this gang block issue, it also immediately became obvious that it would require read-incompatible pool feature. I am very unhappy about it. It would require updates to at least GRUB and FreeBSD loader to not break booting or careful compatibility settings from every user and distribution. The "activate on enable" makes it even worse, while I bet in many cases it could be avoided, for example for any pool with at least one ashift=9 vdev.

Copy link
Contributor Author

@pcd1193182 pcd1193182 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACTIVATE_ON_ENABLE can certainly be avoided. If nothing else, we could have a check in zio_write_gang_block that only activates it if we get allocations back that let us use a header of size > 512. In theory, we could modify the feature refcount for every single gang block created and freed, but that feels like a bit much, since each time you need to create a dmu_tx_t, assign it to the io's txg, etc. Doing it once you actually write a new-style gang block for the first time and never deactivating it should mean that for most pools, it never gets activated, and for those with vdevs with 512 byte sectors, it also wouldn't get activated.

read-incompatible, on the other hand, is definitely unavoidable. For now at least, people can avoid enabling the feature on their root pools, so the bootloader issue isn't as pressing.

Copy link
Member

@amotin amotin Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"avoid activating the feature" means never running zpool upgrade for existing pools, or setting right compatibility property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out a slightly bigger hammer than expected is needed here, so we definitely don't want to update this for every single gang header. It turns out that 1) I forgot that this featureflag needed the MOS flag and 2) MOS features basically have to be updated from syncing context. This means we have to spin out a synctask to do the job for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "activate on enabled" is likely risky here, because that would mean someone without upgraded bootcode would work fine until one day they birthed a gang block, and now it doesn't boot.

Explicitly requiring a zpool upgrade to enable the feature feels far safer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"avoid activating the feature" means never running zpool upgrade for existing pools, or setting right compatibility property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.

I think this is possible with zhack, but doing it as a real feature (zpool set feature@name=disabled $pool) does seem useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest version the feature isn't ACTIVATE_ON_ENABLE anymore. In addition, even when it was, you would still need a zpool upgrade to enable the feature before it would get enabled (and therefore activated), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, although most people blindly zpool upgrade

@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 2ff3989 to 9843def Compare February 4, 2025 21:13
@pcd1193182
Copy link
Contributor Author

Here's a prototype of the changes to do variable leaf size allocations and use those for gang allocations.

It has three primary parts. The first is changes to the allocation code to support allocation ranges, instead of specific sizes. The second is the concept of a "preallocated" write zio, which uses the normal write code path but skips parts of it since the allocation has already been done. The third part is changes to the gang creation code to take advantage of both of the above.

I also added some tests for gang related code, since the test suite doesn't have that much in the way of those.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach to variable size allocation, though code may use some polish. But in pre-allocated writes you've lost me. I haven't looked very close yet, but if we allocate space upfront, I wonder why it can't be implemented with existing rewrite zios?

More comments below, but I also wondering if we could implement variable size allocations first within existing gang blocks without introducing the new pool feature. I think it could help significantly by itself without breaking compatibility.

module/zfs/zio.c Outdated
Comment on lines 3248 to 3264
zio->io_bp = pio->io_bp;
*zio->io_bp = zio->io_bp_orig = *bp;

if (zio->io_abd != NULL)
abd_free(zio->io_abd);
zio->io_orig_abd = zio->io_abd = pio->io_abd;

zio->io_size = zio->io_orig_size = allocated_size;
zio->io_lsize = allocated_size;
zio->io_done = NULL;
zio->io_orig_pipeline = zio->io_pipeline =
(zio->io_pipeline & ~ZIO_GANG_STAGES) |
ZIO_STAGE_WRITE_COMPRESS | ZIO_STAGE_DVA_ALLOCATE;
zio->io_flags |= ZIO_FLAG_PREALLOCATED;
zp.zp_type = zp.zp_storage_type = pio->io_prop.zp_type;
zp.zp_level = pio->io_prop.zp_level;
zio->io_prop = zp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are gang blocks with only one child illegal? Because this mess with live zio conversion from rewrite to regular write IMHO should be. Alternatively I think could be cleaner to replace the rewrite zio pipeline with ZIO_INTERLOCK_STAGES and let it quietly complete without doing anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-child gang blocks aren't illegal, but they are pretty gross. I didn't think this code was too bad, but I agree it's definitely not very clean. I'm fine stripping that out if people don't think it has value. I think just replacing it with the INTERLOCK stages will probably break the throttle accounting, but that will all change once your updated space accounting lands, so you may have a better idea about than I do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about this more, it might actually work fine with a few tweaks. Let me try it out and see.

@pcd1193182
Copy link
Contributor Author

Rather than implementing disabling features, which is a little tricky, I am playing with a lighter-weight solution here. This PR adds the idea of a feature that isn't enabled by upgrade by default. You can still enable it manually, and if it's in your compat file it will be enabled. There are some quirks here; I'm not sure exactly how to handle the code in upgrade that prints all un-enabled features; should it display this one, even though it won't be added by upgrade? Should it skip this one, even though it isn't enabled? For now, I left it as the former; there are some upgrade tests that had to be tweaked to work with this approach. I'm curious if anyone has feelings about this idea, or preferences for what the behavior should be for listing unenabled features.

@github-actions github-actions bot added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Mar 8, 2025
@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from f85a90b to 7a9513d Compare April 1, 2025 20:31
@pcd1193182
Copy link
Contributor Author

This is a rebased version of the changes; the first commit is the entirety of #17111 squashed down. I haven't run the tests or anything, just trying to keep this from getting too far out of sync with master.

@allanjude
Copy link
Contributor

re:

One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in zio_checksum_error, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.

Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read?

@pcd1193182
Copy link
Contributor Author

Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read?

I believe we only track the TXG at which the feature was enabled. Because we use the small gang header unless we can't fit the necessary leaves, we can't just rely on that. A gang block created after the feature was enabled might use the small size, in order to avoid activating the feature unnecessarily.

If we had activation_txg as well, we would also need to tweak the code to always use the larger header once the feature is active. Not a big change, but worth noting.

@pcd1193182 pcd1193182 force-pushed the dyn_gang branch 2 times, most recently from 7dafb0d to 4fe393c Compare April 23, 2025 19:29
Paul Dagnelie added 5 commits May 8, 2025 11:13
ZFS gang block headers are currently fixed at 512 bytes. This is
increasingly wasteful in the era of larger disk sector sizes. This PR
allows any size allocation to work as a gang header. It also contains
supporting changes to ZDB to make gang headers easier to work with.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
@pcd1193182 pcd1193182 marked this pull request as ready for review May 8, 2025 18:13
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels May 8, 2025
Signed-off-by: Paul Dagnelie <[email protected]>
@tonyhutter
Copy link
Contributor

For now, I left it as the former; there are some upgrade tests that had to be tweaked to work with this approach. I'm curious if anyone has feelings about this idea, or preferences for what the behavior should be for listing unenabled features.

One other idea would be to have "creation-only feature flags" that are only enabled at pool creation time. You could never upgrade to these features on existing pools, nor would they be listed on the "zpool upgrade" list. The benefit is we could safely enable it by default (on new pools), and it removes the need for "optional feature upgrades", and might simplify some of the codepaths.

Sure, you wouldn't be able to upgrade existing pools, but that's an ephemeral cost. Over time, the fact that it's on by default for 100% of new pools will make up for it.


One other question - does the code work in such a way that we will always know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active before we read/write our first data block? Or is it possible that we may read some gang blocks in early import before we know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active?

@pcd1193182
Copy link
Contributor Author

One other idea would be to have "creation-only feature flags" that are only enabled at pool creation time. You could never upgrade to these features on existing pools, nor would they be listed on the "zpool upgrade" list. The benefit is we could safely enable it by default (on new pools), and it removes the need for "optional feature upgrades", and might simplify some of the codepaths.

My worry for this case would be that someone creates a boot pool with the new feature while unknowingly using an old bootloader that doesn't support it yet. It usually wouldn't be a huge issue (they'll find out when they boot the system onto the root pool for the first time), but it's an irritation. It is also valuable to be able to upgrade existing pools, since there are probably plenty of people out there with 4k pools who would like to use larger gang block headers without having to recreate their setups.

That said, creation-only flags are probably simpler than this design. If people feel strongly about it I'm not opposed to switching.

One other question - does the code work in such a way that we will always know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active before we read/write our first data block? Or is it possible that we may read some gang blocks in early import before we know if SPA_FEATURE_DYNAMIC_GANG_HEADER is active?

I believe the way this works is that the label config contains a features_for_read list that stores all the label features the system needs to support in order to import the pool. During spa_add we load them into spa_label_features, and then in spa_ld_select_uberblock we check those features for support. That happens before we actually issue any reads, so we should never run into problems where we try to read something before we know if the feature is active.

I think this is a general problem that had to be solved for any read-incompatible mos-required feature (e.g. hole birth, embedded BPs, device removal, draid, etc), so there shouldn't be anything special here.

@behlendorf behlendorf self-requested a review May 21, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants